The Development of Dutch and Afrikaans Language Resources for Compound Boundary Analysis

نویسندگان

Menno van Zaanen

Gerhard B. Van Huyssteen

Suzanne Aussems

Chris Emmery

Roald Eiselen

چکیده

In most languages, new words can be created through the process of compounding, which combines two or more words into a new lexical unit. Whereas in languages such as English the components that make up a compound are separated by a space, in languages such as Finnish, German, Afrikaans and Dutch these components are concatenated into one word. Compounding is very productive and leads to practical problems in developing machine translators and spelling checkers, as newly formed compounds cannot be found in existing lexicons. The Automatic Compound Processing (AuCoPro) project deals with the analysis of compounds in two closely-related languages, Afrikaans and Dutch. In this paper, we present the development and evaluation of two datasets, one for each language, that contain compound words with annotated compound boundaries. Such datasets can be used to train classifiers to identify the compound components in novel compounds. We describe the process of annotation and provide an overview of the annotation guidelines as well as global properties of the datasets. The inter-annotator agreement between the annotators was considered highly reliable. Furthermore, we show the usability of these datasets by building an initial automatic compound boundary detection system, which assigns compound boundaries with approximately 90% accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Rapid rule-based machine translation between Dutch and Afrikaans

This paper describes the design, development and evaluation of a machine translation system between Dutch and Afrikaans developed over a period of around a month and a half. The system relies heavily on the re-use of existing publically available resources such as Wiktionary, Wikipedia and the Apertium machine translation platform. A method of translating compound words between the languages by...

متن کامل

Developing a Broadband Automatic Speech Recognition System for Afrikaans

Afrikaans is one of the eleven official languages of South Africa. It is classified as an under-resourced language. No annotated broadband speech corpora currently exist for Afrikaans. This article reports on the development of speech resources for Afrikaans, specifically a broadband speech corpus and an extended pronunciation dictionary. Baseline results for an ASR system that was built using ...

متن کامل

Cross-Lingual Genre Classification for Closely Related Languages

Resource-scarcity is a topic that is continually researched by the HLT community, especially for the SouthAfrican context. We explore the possibility of leveraging existing resources to help facilitate the development of new resources for under-resourced languages by using cross-lingual classification methods. We investigate the application of an Afrikaans genre classification system on Dutch t...

متن کامل

Classification of Noun-Noun Compound Semantics in Dutch and Afrikaans

This article presents initial results on a supervised machine learning approach to determine the semantics of noun compounds in Dutch and Afrikaans. After a discussion of previous research on the topic, we present our annotation methods used to provide a training set of compounds with the appropriate semantic class. The support vector machine method used for this classification experiment utili...

متن کامل

AfriBooms: An Online Treebank for Afrikaans

Compared to well-resourced languages such as English and Dutch, natural language processing (NLP) tools for Afrikaans are still not abundant. In the context of the AfriBooms project, KU Leuven and the North-West University collaborated to develop a first, small treebank, a dependency parser, and an easy to use online linguistic search engine for Afrikaans for use by researchers and students in ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

The Development of Dutch and Afrikaans Language Resources for Compound Boundary Analysis

نویسندگان

چکیده

منابع مشابه

Rapid rule-based machine translation between Dutch and Afrikaans

Developing a Broadband Automatic Speech Recognition System for Afrikaans

Cross-Lingual Genre Classification for Closely Related Languages

Classification of Noun-Noun Compound Semantics in Dutch and Afrikaans

AfriBooms: An Online Treebank for Afrikaans

عنوان ژورنال:

اشتراک گذاری